Global Document Frequency Estimation in Peer-to-Peer Web Search
نویسندگان
چکیده
Information retrieval (IR) in peer-to-peer (P2P) networks, where the corpus is spread across many loosely coupled peers, has recently gained importance. In contrast to IR systems on a centralized server or server farm, P2P IR faces the additional challenge of either being oblivious to global corpus statistics or having to compute the global measures from local statistics at the individual peers in an efficient, distributed manner. One specific measure of interest is the global document frequency for different terms, which would be very beneficial as term-specific weights in the scoring and ranking of merged search results that have been obtained from different peers. This paper presents an efficient solution for the problem of estimating global document frequencies in a large-scale P2P network with very high dynamics where peers can join and leave the network on short notice. In particular, the developed method takes into account the fact that the local document collections of autonomous peers may arbitrarily overlap, so that global counting needs to be duplicateinsensitive. The method is based on hash sketches as a technique for compact data synopses. Experimental studies demonstrate the estimator’s accuracy, scalability, and ability to cope with high dynamics. Moreover, the benefit for ranking P2P search results is shown by experiments with real-world Web data and queries.
منابع مشابه
Overlap-Aware Global df Estimation in Distributed Information Retrieval Systems
Peer-to-Peer (P2P) search engines and other forms of distributed information retrieval (IR) are gaining momentum. Unlike in centralized IR, it is difficult and expensive to compute statistical measures about the entire document collection as it is widely distributed across many computers in a highly dynamic network. On the other hand, such network-wide statistics, most notably, global document ...
متن کاملDocument Clustering for Distributed Fulltext Search
Recent research efforts in peer-to-peer (P2P) systems concentrate on providing a “distributed hash table”-like primitive in the P2P system (Stoica et al., 2001). However, to make P2P systems useful, we need to build a keyword search engine to index the entire document collection in the distributed system. Doing keyword search in a distributed environment poses new challenges for traditional inf...
متن کاملBuilding a peer-to-peer full-text Web search engine with highly discriminative keys
Web search engines designed on top of peer-to-peer (P2P) overlay networks show promise to enable attractive search scenarios operating at a large scale. However the design of effective indexing techniques for extremely large document collections still raises a number of open technical challenges. Resource sharing, self-organization, and low maintenance costs are favorable properties of P2P over...
متن کاملODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval
We consider the problem of building a P2P-based search engine for massive document collections. We describe a prototype system called ODISSEA (Open DIStributed Search Engine Architecture) that is currently under development in our group. ODISSEA provides a highly distributed global indexing and query execution service that can be used for content residing inside or outside of a P2P network. ODI...
متن کاملThe ALVIS Document Model for a Semantic Search Engine
ALVIS researches the design, use and interoperability of topic-specific search engines with the goal of developing an open source prototype of a peer-to-peer, semantic-based search engine. Our approach is not the traditional Semantic Web approach with coded meta-data, but rather an engine that can build on content through semi-automatic analysis. This paper describes the ALVIS document processi...
متن کامل